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ABSTBACT 

A 15-stage pyramidal test and a 40-ite»' two-stage 
test vere constructed and adiinistered by computer to 111 college 
undergraduates. The tvo- stage test vas found to utilize a sialler 
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froii r«.79 to .84. Both adaptive strategies appeared to adapt itei 
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chance ef:!ects due to guessing. The pyramidal strategy seeded to be 
slightly uore successful in eliiinating guessing than the tvo<*stage 
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conventional testing. Siiulation studies ar:^ suggested to further 
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AN EMPIRICAL COIPARIIJON OF TWO-STAGE 
AND PYRAMIDAL ADAPTIVJ2 ABILITY TESTING 



The administration of ability test i':em8 by TCans of an interactive 
computer system has enabled test administrators to tailor or adapt tests to 
individual differences in testee ability. Items are selected by a set of 
rules or "strategy" determined prior to tiisting (see Weiss, 1974, for a 
discussion of the various adaptive testing strategies). At one or more 
points in the testing, a testee 's respoasas to previously administered items 
are evaluated, and a tentative estimate o:: ability is made. Subsequent 
items are generally selected so that theii: difficulties are close to the 
testee *s estimated ability. This procedure permits testing time to be 
shortened in coiiq>ari8on to conventional p^iper and pencil methods of tasting 
vrlthout reducing either the reliability or validity of the test. Computerized 
adaptive testing also has other advantages: over conventional tests (see 
Weiss and Betz, 1973). 

Empirical research on two adaptive strategies, the pyramidal test and 
the two-stage test, has been reported in the present series of research 
papers (Betz & Weiss, 1973; Larkin & Weiss, 1974). In both of these studies, 
the adaptive test was compared to a conventional test on a number of 
psychometric criteria. The present study directly compares the two adaptive 
strategies using the same group of subjectn. 

Pyramidal Tests 

The pyramidal testing method structures items into a triangular con- 
figuration according to item difficulties. Item .administration follows the 
general branching rule that a more difficult item follows a correct response 
while an easier item follows an incorrect response. Figure 1 illustrates a 
typical pyramidal test. The first Item administered is at the top of the 
pyramidal structure and is usually .4ne of median difficulty (proportion 
correct, p=.SO) based on previous Item analyses. The second item administered 
to any testee depends on whether his/her response to the first item is 
correct or Incorrect. If the testee answers the first item correctly, a 
— more difficult item (p^.4S) is presented nexl:. An. item of lesser difficulty . 
(p-.SS) is presented next if the initial itent is answered incorrectly. 
Thus, there are two items available at the second level or "stage" of the 
pyramid. Branching to the third stage depends on the correctness of the 
response to the second-stage item. This proci^ss is repeated until the 
testee has attempted one item at each of a fisced number of stages. 

The increment in difficulty following a correct response in Figure 1 is 
equal to the decrement in difficulty following an incorrect response. Thus, 
branching within this pyramidal structure uses an "equal offset." Unequal 
offsets with smaller increments than decrements can be used as a correction 
for guessing (Weiss, 1974, p. 16). 

The number of items to be answered by any testee is pmall when compared 
to the total number of items in the pyramidal structure. In general 

items are needed to construct a pyramid of n stages when one item is attempted 
at each stage. 
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Many ways of scoring pyramidal tests have been developed (see Weiss » 
1974, pp. 30-34). The ranked difficulty of the final item has been used 
as the individual's score (Bayroff , Thomas, & Anderson, 1960; Seeley, Morton 
& Anderson, 1962; Waters i Bayroff, 1971). Testees coir>leting the pyramid 
shown in Figure 1 could receive scores of from 1 to 1" under this scoring 
method, since there are only ten items available at the tenth stage of 
testing. The number of rank positions can be increased by assigning a 
higher rank to those subjects answering the final item correctly than to 
those who do not (Bayroff & Seeley, 1967; Waters, 1964). The difficulty 
of the final item atteiq>ted has also been used to estimate an individual's 
ability (Bayroff, 1969). Another scoring method branches the testee to a 
hypothetical (n+l)th item following the final item and estimates its 
difficulty (Hansen, 1969; Lord, 1971b; Weiss, 1974, p. 31). The difficulties 
of all items atteiq>ted or all items correctly answered may be averaged to 
provide a score based on more Information (Larkln & Weiss, 1974). Lord 
(1970, 1971b) has reconmended an averaging method which excludes the first 
item (since all testees atteiq^t it) and includes the (n'hl)th item. 
Hansen (1969) has proposed a more co]iq>lex scoring method which assigns an 
estimated score to each item in the pyramid, whether or not- it is attempted. 

Weiss (1974) compares pyramidal tests with other strategies of 
adapuive testing. The research literature on pyramidal adaptive tests has 
been reviewed by Weiss and Betz (1973) and summarized by Larkln and Weiss 
(1974). 

Two-stage Tests 

A two-stage test consists of a preliminary or routing test followed by 
one of several measurement tests. Figure 2 Illustrates a sample two-stage 
structure. The purpose of the routing test is to provide an approximate 
estimate of the testee's ability level so that a measurement test of 
appropriate difficulty can be selected for each testee. The routing test 
can be composed of items with difficulties either peaked at the ability 
level of the group taking the test (as shown in Figure 2) or distributed 
throughout the range of ability under consideration (see Weiss, 1974, pp. 4-7). 
The measurement tests are usually peaked tests of differing levels of 
difficulty. The routing test is administered to the testee and his/her 
score is determined. A measurement test of appropriate difficulty is 
selected, based on the testee's score on the routing test. The measurement 
test is then administered and the testee 's score is determined. 

Variants of the two-stage routing procedure (see Weiss, 1974, p. 7) 
include double-routing and "sequential" procedures (Cleary, Linn, & Rock 
1968a,b; Linn, Rock, & Cleary, 1969). The former requires two routing 
tests to be administered. A testee^s score on a preliminary routing test 
determines which of several intermediate routing tests are attempted. 
Branching to an appropriate measurement test is based on the testee* s 
performance on the second routing teat. The sequential procedure involves 
computing likelihood ratios after each response to items in the routing 
test. Branching to a measurement test occurs when the likelihood ratio permits 
a classification of the individual. 

Most methods of scoring two-stage tests have used information from 
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both the routing and measurement subtests. Lord (1971c) and Betz and 
Weiss (1973) have combined maximum likelihood ability estimates from the 
routing and measurement subtests to determine an overall estimation of a 
testee*s ability. Linn, Rock, & Cleary (1969), on the other hand, did not 
Include a testee's performance on the routing test In some of their scoring 
procedures. 

Weiss (1974) compares two-stage tests with other strategies of adaptive 
testing, and discusses potential advantages and limitations of this approach. 
Research literature on two-stage adaptive testing has been reviewed by Weiss 
& Betz (1973) and Betz & Weiss (1973) . 

Research Comparing Two-stage and Pyramidal Tests 

The only study Including both two-stage and pyramidal testing strctegles 
was reported by Linn, Rock & Cleary (1969). That study, using "real-data 
simulation" methods, was based on the responses of a large group of testees 
to a 190-ltem conventional paper-and-pencll test. The Item responses were 
then used to simulate a testee^s responses to two-stage and pyramidal adaptive 
testing strategies. Five different two-stage strategics were compared to 
two pyramidal strategies. 

The first two-stage test Included a 20-ltem routing test with a rectangular 
distribution over a "broad range" of Item difficulties, and four 20-ltem 
measurement tests. The second employed a double-routing procedure. A 
testee's score on a 10-ltem routing test determined which of two second-stage 
10-item routing tests was administered. Scores on the second routing test 
branched the testee to one of four 20-ltem measurement tests. The third two- 
stage procedure used a 20-item "group discrimination" rouclng test. Items 
In that test were those which showed the largest differences in proportion 
correct between groups divided Into quartlles on total scores for the original 
190 Items. The routing test In the two final strategies Involved computing 
likelihood ratios after each item, and branching occurred when the likeli- 
hood ratio permitted a classification of the Individual Into groups based on 
scores derived from the parent 190 Items. These methods were called "sequential" 
procedures. Both a three-group and a four-group sequential approach were 
used. Linn et al. used two methods to score their two-stage tests. One 
used the Information obtained from the routing test while the other did not. 

Linn et al. studied two variations of the pyramidal strategy. The first 
pyramidal test had ten stages with an entry point of p=.6S^ a step size of 
.02 and an equal offset. Items were weighted according to difficulty, and 
scores represented the sum of the weights of Items attempted by each testee. 
The second pyramidal strategy consisted of five stages with five items per 
stage (see, e.g., Weiss, 1974, pp. 25-26). Branching occurred from block to 
block. This pyramid was scored using a weighted scoring scheme similar to 
that used for the single-Item pyramid. 

All seven adaptive tests were compared to five conventional subtests of 
from 10 to 50 Items selected from the same 190-ltem parent test. Scores on 
the two-stage strategies correlated from .93 to .97 with scores on the 190- 
Itam parent test, while the shortened conventional tests had correlations of 
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from .89 to .96 with the full conventional test. The 25-ltem pyramid showed 
a comparable correlation (.95) » but the ten-'ltem pyramid correlated only 
.87 wlLn the parent test. ^ 

Since all the Items In the adaptive and shortened conventional tests 
were also Included In the longer parent test, and since the correlations with 
the parent test Increased as the length of the shorter teats Increased » It 
Is possible that the degree of correlation obtained In this study could be 
due partly to the number of Item^ In common between the tests. 

When achievement test criteria were obtained, scores on the ten-stage 
pyramidal test correlated higher with the criterion measures than did scores 
on conventional tests of the same length In seven of eight comparisons. 
The 25-ltem pyramids correlated more highly with the criteria than the 50-ltem 
conventional tests. With one exception t the two-stage tests also achieved 
higher correlations with the criterion achievement tests than did conventional 
tests of the same length. Under one scoring procedure the 40-1 tern group 
discrimination two-stage test was laore higihly correlated with outside 
criteria, in four of eight comparisons, than even the 190-1 tem parent 
conventional test. 

Linn et al.'s data permit a comparison of the relative validity of 
their two-stage and pyramidal tests as predictors of the achieveuent test 
criteria. Their data show that, with the exception of the sequential two- 
stage strategy, two-stage tests had higher correlations with the criteria than 
did the pyramidal tests. The ten-item p3rramidal test had the lowest validities ^ 
of all of the adaptive tests, and the validities of the 25-item block branching 
pyramid were about equal to those of the sequential two-stage test. Within 
the two-stage tests, the group discrimination approach had slig^itly higher 
validities than the other two-stage tests. 

These comparisons of the relative validity of the two-stage and pyramidal 
strategies did not take account of the relative numbers of items in the 
different tests. While the two-stage tests were all composed of about 40 items, 
only 10 items were administered in one pyramidal test and 25 in the other. 
Linn ct al. (pp. 142-143) estimated the lengths of conventional tests parallel to 
the 190-ltem parent test which would be necessary to achieve the same validity 
as each of the adaptive tests. When these values were compared to the actual 
adaptive test lengths, an index of "relative saving in test length'* was 
obtained. The group discrimination and three-group sequential methods showed 
the highest ratios, followed by the 25-1 tem and 10-item pyramidal strategies. 
The four-group sequential method showed the lowest ratios. 

Purpose 

Although Linn et al. (1969) used both pyramidal and two-stage tests in 
their study simulating adaptive testing, their major objective was to study 
the relationships between short adaptive and conventional tests and longer 
parent tests or achievement test criteria. The present investigation is one 
of a series of studies designed to further compare adaptive testing strategies 
using other criteria. These studies use actual computer administration of 
adaptive tests to groups of college students. The results of different adaptive 
testing strategies have been compared with those obtained from conventional 
testing approaches (Betz & Weiss, 1973; Larkin i Weiss, 1974) with respect to 
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the accuracy of ability estimation, test-retest stability, internal consistency 
reliabilities, and other psychometric characteristics. In addition, more 
fundamental questions about each strategy are under consideration, including 
the investigation of various item difficulty structures for each of the 
adaptive strategies, problems in determining branching or routing rules, and 
the determination of meaningful and reliable scoring methods for each adaptive 
strategy. 

In this series of studies, all tests, both conventional and adaptive, 
were constructed for administration by computer (DeWitt & Weiss, 1974). 
Testing strategies were administered two at a time so that scores from one 
adaptive strategy could be compared with those from another, and so that scores 
from adaptive and conventional tests could be directly compared. In order to 
determine the stability of scores from each of a number of scoring methods, 
each testee was administered the same test on two occasions with periods 
averaging about six weeks between the initial and final testing. In some 
studies, conventional and adaptive strategies were paired on both test and 
retest and the comparative stabilities of the two strategies were studied. 
Other studies focused on comparisons among the various adaptive strategies* 

The present analysis was undertaken for the purpose of directly comparing 
the psychometric characteristics of scores obtained from a two-stage strategy 
and the pyramidal approach. Previous studies in this series have reported the 
results of analyses of computer-administered two-stage (Betz & Weiss, 1973) and 
pyramidal tests (Larkin & Weiss, 1974) in comparison with conventional tests. 
However, a different group of subjects was used in each of those studies. In 
the present study, the characteristics of scores derived from two-stage and 
pyramidal tests are compared directly using the same group of subjects. 

METHOD 

One set of test data was derived from the administration of e '^wo-stage 
test and a pyramidal test to 111 subjects. 

The .\5-stage pyramidal item structure was composed of 120 items. Each 
testee cc pie ted only fifteen items. The two-stage test required 130 items 
for its construction and each subject answered 40 items. Both tests drew 
items from the same item pool, and eighty items were common to both test 
structures. Although each testee could be administered a maximum of 15 items 
common to both tests, it was also possible that a testee could receive no 
common items. 

In order to detect the presence of the effects of boredom or fatigue, the 
order of presentation was randomized on both testings. Each adaptive test was 
administered first to half the testees and administered second to the remaining 
testees. 

Test Construction 

Ii.em Pool 

The Item pool was composed of 369 five- alternative multiple-choice 
vocabulary questions noraed on college undergraduates (McBrlde & Weiss, 1974). 
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Uslng estimates of Item difficulty (proportion correct) and dtem discrimination 
(biserial correlation with total score on the norming tests) approximations to 
the normal ogive item parameters a and b (Lord & Novlck, 1968, pp. 376-379) 
were determined using the following formulas: 




where a is the normal ogive index of discrimination 
b is the normal ogive index of difficulty 

is the biserial correlation of item response and total score 

and 4"^ is the inverse of the cumulative normal distribution 
corresponding to the proportion correct. 

Items with biserials lower than .30 were not used in the item pool. The norming 
studies indicated that there wa^ some difficulty-discrimination interaction 
such that the pool contained disproportionately more highly dlscrlminatlns 
items in the lower range of item difficulty. 

Construction of the Two-stage Test 

The two-stage test used in this study was composed of a 10-ltem routing 
test and four 30-ltem measurement tests. This adaptive test was the "Two- 
stage 2" test in a simulation study iu a previous report in this series 
(Betz & Weiss, 1974). 

Routing test . In order to make a good initial assessment of ability and 
tx^ assign testees to measurement tests while minimizing the probability of an 
assignment error, the 10 items in the routing test were selected to have a 
high mean discrimination. As shown In Table 1, mean dlscrimin :tion for the 
routing test was a=.702. The standard deviation of the item discriminations 
was .163. Appendix A, which shows difficulty and discrimination values for 
each item in the routing subtest, indicates that the lowest discrimination 
was a=^SO and the highest was a=.98 . 

The routing subtest was a peaked test of median difficulty items which 
were highly discriminating. The items in the routing subtest had a mean 
difficulty level of ba-.232. Table 1 shows that the standard deviation of the 
item difficulties in the routing test (.50) was very low when compared to those 
of ihe measurement teats. 

After the routing test was completed, an estimate of the testee's ability 
was made in standard units (see Betz & Weiss, 1974, pp. 11-12). Subjects were 
assigned to the measurement test closest in difficulty to their estimated 
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ab^^llty. Thus, those testees with from 0 to 4 items correct on the routing test 
were assigned to the least difficult of the measurement tests. Those with 
scores of 5^6^ 7-8, and 9-10 were routed to one of the three more difficult 
measurement tests. 



Table 1 



Means and Standard Deviations of Normal Ogive Item 
Parameters for Two-stage and Pyramidal Tests 



No. 



Test 




Items 


Mean 


S.D. 


Mean 


S.D. 


Two-stage 
(all items) 




130 


-.072 


1.251 


.633 


.183 


Routing 




10 


-.232 


.050 


.702 


.163 


Measurement 


1 


30 


1.725 


.558 


.530 


.126 


Measurement 


2 


30 


.350 


.297 


.684 


.214 


Measurement 


3 


30 


-.709 


.189 


.611 


.122 


Measurement 


4 


30 


-1.603 


.373 


.683 


.213 


Pyramid 




120 


-.094 


1.256 


.799 


.457 



Meas urement tests . In selecting Items for each of the four measurement 
tests, the following rationale was used. The quantity a(b^-b^) was computed, 
where b is the mean difficulty of the routing test. The a parameter is the 
mean discrimination for all 130 items in the two-stage structure, i.e., .633; 
b. represents the mean difficulty of the measurement test in question. Betz 
aid Weiss (1974) have shown that to obtain four measurement tests suitable for 
subjects routed to them, the values required for a(b^-b^) were 1.239, .JbtJ, 
-3.02 and -.868. 

Table 1 shows that the average of the discrimination parameters for the 
measurement tests ranged from .530 to .684, and that the^v«"8« 
of discrimiiuition values foi the measurement tests was about the same as 
^veJagr^r^^bility of item dlccriminations in the routing test. f --'^-"-"^ 
tests were not as peaked as the routing test as indicated by the larger 
Ganges and standard deviations of their difficulties. The average f 
oTlhl ^Lurement tests, as shown in Table 1, approximated the desired values. 
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but iDeasureoent test 1 was sooevhat more difficult than the target value, 
and measurement tests 3 and 4 were somewhat easier. Appendix A gives the 
normal ogive Item parameters for the Items In each measurement test. 

Scoring . The two-stage test was scored by the same method used by Betz 
and Weiss (1973, 1974) who adapted their method from studies by Lord (1971c). 
Essentially, maximum likelihood estimates of ability were obtained from both 
subtests and then weighted and summed. The measurement test was given three 
times the weight of the routing test because there were three times as many 
items in it as in the routing test. 

The formula used to obtain the ability estimates for both subtests 
completed by each testee was: 




is the mean discrimination of the subtest 
is the nuinber correct 



where a . 

^ 

X 



m is the number of items in the subtest 



e is the chance score level 

is the mean difficulty of items in the subtest 

and $ ^ is the inverse of the cumulative normal distribution 
function corresponding to the proportion correct. 



For perfect scores (x=^), 9 could not be determined. Therefore, when x was 

equal to m, it was replaced by a?=m-.5. For scores at or below chance 

(x<am) 9 6 was also indeterminate and x was replaced by x=^cnhh*S. 

The scores of the subtests were combined in the following way: 



^ Qj^^Qg (4) 

where 9 is the combined ability estimate 



6^ is the ability estimate obtained from the routing test 

0^ is the ability estimate obtained from the measurement test. 

This combined ability estimate can be interpreted as a standard normal deviate 
(see Betz & Weiss, 1973, pp. 14-15). 
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Constructlon of the Pyramidal Test 

Tlie pyramidal test used In this study was Pyramid 3, studied by Larkln & 
Weiss (1974). It was composed of fifteen stages with a constant step size. 
An up-one/down-one branching rule was used. Since n(n+l)/2 Items are needed 
for the construction of an n-stage pyramid, ISdS-hD/B or 120 Items were 
selected from the Item pool. The Initial Item was of median difficulty for 
the testees of the norm group. The step size, that Is the Increment or 
decrement In Item difficulty from one stage to the next, had a mean value of 
b».199, and a standard deviation of «08. 

After establishing the Initial Item difficulty and step size, the available 
Items In the pool were divided Into 29 groups on the basis of difficulty. 
All Items In a group had about the same b value and an a value of at least 
.30. The Items required were selected from each group according to their 
discriminations. The Itema with the highest discriminations In each group 
were selected for use In the pyramidal test. Paterson (1962) has suggested 
that items in a pyramidal test be ordered within each column according to 
discrimination with the most discriminating items appearing first. This 
suggestion was followed in construction of this pyramidal test, as shown in 
Appendix B which gives the normal ogive difficulty and discrimination estimates 
for each item in the pyramidal test. The item difficulties ranged from 
b«-2.86 to b«2.61. The discrimination values varied from a«.4l to a«3.00. 

Appendix B indicates that the initial item, which was presented to all 
testees, had a difficulty of b«-.05. If the subject answered this item 
correctly, he/she was branched to a more difficult item (b«.l4) at stage 2. 
An incorrect response branched the testee to an item easier (b«-.21) than the 
initial item. The branching process continued until each testee had attempted 
15 items 4 

The means and standard deviations for difficulty and discrimination are 
shown in Table 1. The average difficulty of the items in the pyramidal 
structure was b=-.094, with a standard deviation of 1.256. The average 
discrimination of the pyramid items was a=.799. When all items in each 
adaptive test were considered. Table 1 shows that the overall difficulties 
were almost the same. The 120 items in the pyramidal structure and the 130 
items in the two-stage test had very similar means and standard deviations of 
item difficulties. However, the pyramid was composed of more highly dis- 
criminating items and the variance of the item discriminations was much 
higher in the pyramidal test. 

Scoring . In order to compare ability estimates derived from various 
scoring methods, four different methods were used to estimate ability. These 
four methods were among those used in a previous investigation of pyramidal 
testing (Larkln & Weiss, 1974). Method 1 was the number of correct responses. 
This has been the most common scoring method used in other studies. For a 
pyramid of 15 stages, 16 different number correct scores are possible (0 to 
15). Method 2 was the mean difficulty of the items attempted by each testee. 
An approach similar to this involves averaging the difficulties of all items 
but the first (since every testee attempts it) and including a hypothetical 
sixteenth item (Lord, 1970, 1971b). Method 3 averages the difficulties of 
the correctly answered itema only. Under method 4, subjects were scored by 
the difficulty of the final item attempted in the pyramid; since the branching 
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strategy actually adapts the difficulties of the items to the ability of 
the testee, the difficulty of the final item reached should reflect the 
testee^s ability level (assuming that the pyramidal structure has enough 
stages). Two other scoring methods, the (n+lHh difficulty score and the all- 
item score (Hansen, 1969) were found in previous research (Larkin & Weiss, 
1974) to correlate perfectly with the number correct score and mean diffi- 
culty of all items attempted respectively. Consequently, these two scoring 
methods were not used in the present analyses. 

Test Administration and Subjects 

Cathode ray terminals (CRT's) acoustically coupled to a time-shared 
computer systerm were used to administer both the two-stage and pyramidal 
test (DeWitt & Weiss, 1974). Items were presented one at a time on the CRT 
screen; subjects responded by typing a nuniber corresponding to the correct 
alternative to each multiple-choice item. A total of 55 items (15 from the 
pyramidal test and 40 from the two-stage test) was administered to each 
testee. The order of presentation of the tests was randomized over subjects. 
Fifty-six testees completed the pyramidal test first and the 55 remaining 
testees completed the two-stage test first. Subjects were informed at the 
completion of testing of the total number of items they answered correctly. 

The testees were undergraduates enrolled in general psychology or 
psychological statistics courses at the University of Minnesota. Because this 
combination of adaptive tests was given as the second session of a two-part 
study, all had had previous experience vith computer-administered tests. All 
subjects were given the opportunity to review instructions explaining the 
operation of the CRT's prior to testing. A proctor was available in the 
testing room to begin the testing and to provide further assistance to any 
testee having difficulty with the equipment. No time limit was Imposed. 
Testees were informer that they might take as much time as necessary to finish 
the tests. 

Analysis 

The data analyzed in the present study consisted of five scores, one 
two-stage score and four pyramidal scores, for each testee. 

Order Effects 

The effects of the order of administration on test scores were investigated 
by comparing scores of the testees who received each strategy first with those 
who received that strategy second in the series of two tests. In this manner 
fatigue, practice, or carry-over effects between strategies could be detected. 
Because the scores were expected to be highly correlated a one-way multivariate 
analysis of variance was used with all five scores simultaneously considered 
as dependent variables. 

Characteristics of Score Distribution 

One objective of the present study was to compare the distributions of 
scores on the 40-item two-stage test with those obtained from each method of 
scoring the 15-stage pyramidal test. The appropriateness of the test difficulty, 
the relative variabilities of each snoring procedure, and the shape of the 
obtained score distributions were examined. 
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Because different units were used in scoring the tests, the standard 
deviation of each scoring method was divided by the potential range of scores 
under that method. The resulting value is an index of relative variability 
(Betz & Weiss, 1973) • This index shows the effective utilization of the entire 
score range for each scoring method. The range of possible scores on the ^ 
two-stage test was derived using Formulas 3 and 4 to compute estimates of 0 
for perfect and chance scores* This range was 4.66-(-5.30)«9.96. The ranges 
for the pyramidal scoring methods were as follows: 1) The number correct 
range was 15; 2) The range for the mean-diffic it/-at temp ted score was the 
difference between the score made by a testee answering all items correctly 
and the score of one responding incorrectly to all items. This value was 
2.79; 3) The range for the mean-difficulty-correct score was the difference 
between the score of a subject with 15 correct responses and the lowest 
(n+l)th score. The latter value was used since a testee with no items answered 
correctly would have a mean-difficulty-correct score which was undefined. 
This range was 4.42; 4) The final item difficulty range was the difference 
between the easiest and most difficulty terminal items, or 5.48. 

In addition to the mean and variability indices, the skew and kurtosis 
of each distribution was computed and the significance and direction of its 
departure from normality were determined (McNemar, 1969, pp. 25-28, 87-88). 

Relationships between Two-stage and Pyramidal Scores 

To determine the relationships among the pyramidal scores and their 
relationships to two-stage scores, product-moment correlations and correlation 
ratios (eta) were computed. The latter were computed to determine whether the 
relationships between scores on the two strategies were curvilinear. In 
determining the etas, both the regression of two-stage scores on pyramidal 
scores and the regression of pyramidal scores on two-stage scores were 
computed . 

Internal Consistency Reliability 

Data on the reliabilities of the two-stage and pyramidal tests are import- 
ant to provide a point of reference for interpreting the correlations between 
scores on the two adaptive strategies. 

The internal consistency reliability of the two-stage test was determined 
by Hoyt's. (1941) method. This index can be computed only if every subject 
attempts each item on a test. For this reason, the two-stage test had to 
be treated as five separate tests. Reliabilities were computed separately 
for the routing test, using the responses of the total groip of subjects, and 
for each of the four measurement tests, using the responses of those subjects 
routed to each measurement subtest. To compare the internal consistencies of 
the 10-item routing test with that of 30-item measurement tests, the Spearman- 
Brown formula was used to estimate the reliability of a 30-item routing sub- 
test based on the testees* responses to 10 items. 

Because all testees do not answer the same subset of items under the 
pyramidal strategy, its internal consistency reliability cannot be determined 
satisfactorily (see Larkin & Weiss, 1974). Consequently, to make meaningful 
comparisons between the reliabilities of the pyramidal and two-stage tests, 
the test-retest correlations for each strategy determined from two previous 
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emplrical studies (Betz & Weiss, 1973; Larkln & Weiss, 1974) were used. 
Mls-routlng 

Mis-routing occurs In the two-stage strategy when a testee Is routed to 
measuremcn*: tests of inappropriate difficulty. The following criteria were 
used (see Betz & Weiss, 1973) to determine the proportion of testees who 
were mis-routed. All testees who obtained perfect scores (30) on their 
measurement subtest were considered to have been routed to a test too easy 
'^r them. Those testees with subtest scores at or below chance level 
(i.e., 6 correct) were considered to have been assigned to a measurement test 
too difficult for them. If a testee met either of the two criteria, he/she 
was classified as having been mis-routed by the routing test. 

Intercorrelatlons of Pyramidal Scores 



Product-moment correlations were computed for all pairs of pyramidal 
scoring methods to determine the interrelationships among them. Correlation 
ratios were computed and coiq>ared with the product«nnoment correlations to 
detect the presence of possible curvilinear relationships « 



RESULTS 



Order Effects 

Table 2 shows the means and standard deviations by scoring method and 
strategy for the groups completing p3rramidal or two-stage tests first. 
The one-way multivariate analysis of variance resulted in an F-value of .92 
with an associated probability of .47. Thus the two sets of mean scores 
obtained under the two orders of administration were not significantly 
different. As a result, the data from both order groups were combined for 
all further analyses* 

Table 2 



Means and Standard Deviations for Subgroups 
Completing Pyramidal and Two-Stage Tests in 
Different Orders 



ERIC 



Pyramid First 

(N«56) 
Mean S.D. 
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Two-stage First 
(N«55) 
Mean S.D. 



pyramidal Test 










Number Correct 


8.21 


2.55 


7.64 


2.24 


Mean difficulty- 
attempted 


0.10 


0.56 


-0.09 


0.53 


Mean difficulty- 
correct 


-0.02 


0.61 


-0.22 


0.57 


Difficulty of 
final item 


0.17 


0.97 


-0.06 


0.88 


Two-stage Test 


-0.16 


1.39 


-0.50 


1.19 
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Score Distributions 

Pyramidal test * Descriptive statistics for the pyramidal and two-stage 
test scores are presented In Table 3. The mean number correct score of 7.93 
Indicated that the subject group as a whole answered approximately half the 
15 Items In the pyramid correctly, suggesting that the difficulty of the test 
was appropriate for the ability of the group tested. Ttie two mean difficulty 
scoring methods and the final item difficulty scoring me^.hods all had means 
of about O.O. Since the test was coioposed of items with mean difficulty 
of -.094, this result was expected. These results also suggest that there 
were few items answered correctly as a result of guessing, oa the average, 
since guessing would have resulted in scores above the average, of the norming 
group. 



Table 3 



Descriptive Statistics for Distributions of Scores 
from Pyramidal and Two-stage Tests 
(N - 111) 



Test and 
Scoring Method 


Mean 


Median 


S.D. 


Proportion 
of 
Sange 
Utilized 


Skew 


Kurtosls 


Pyramidal Test 














Number Correct 


7.93 


7.43 


2.41 


.16 


0.58* 


0.08 


Mean difficulty — 
attempted 


0.01 


-0.12 


0.55 


.20 


0.42 


-0.47 


Mean difficulty — 
correct 


-0.12 


-0.23 


0.60 


.14 


0.03 


0.19 


Difficulty of 
final Item 


0.06 


-0.08 


0.93 


.17 


0.44 


-0.20 


Two-stage Test 


-0.33 


-0.54 


1.30 


.13 


0.35 


0.29 



*Statlstlcally significant at p<.05. 



The variabilities for each scoring method are also shown in Table 3. 
The final item difficulty score had a standard deviation of about 1.0, again 
reflecting the characteristics of the standardized b-values. Because of the 
restriction in the range of possible score values resulting from the use of 
averages, the two mean difficulty scores, also computed from b-values, had 
standard deviations only about half as large as the final item difficulty 
scoring method. When variability is expressed as a proportion of each method's 
potential range, as shown in Table 3, the scoring methods are more easily 
compared. The mean-difficulty-correct scoring method utilized the smallest 
proportion of its available range (.14), while the mean-difficulty-attempted 
method used the largest proportion of its range (.20). The number correct 
score and the final item difficulty scores both utilized about the same 
proportion of their range (.16 and .17). 
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All scoring methods had distributions which were slightly positively 
skewed. The distribution of number correct scores was the most highly skewed, 
and Its skewness was significantly different from zero skew. The difficulty 
of all Items correctly answered showed almost no skew. 

Two score distributions — mean-dlf f Iculty-attempted and difficulty of 
final Item — were platykurtlc, although not significantly so. The number correct 
and mean-dlf flculty-correct distributions were slightly leptokurtlc. The 
flattest distribution was that of the mean-dlf f Iculty-attempted scoring 
method. When both skewness and kurtosls are considered, the mean-dlf flculty- 
correct scores showed least departure from a normal distribution. 

Two-stage test . The two-stage test scores, expressed In standard units, 
had a mean of -0.33 and a standard deviation of 1.30. This mean was slightly 
lower than that observed In the standardized pyramidal scores . The two-stage 
test utilized a smaller proportion of Its possible range (.13) than any method 
of scoring the pyramidal test. 

The distribution of two-stage scores was slightly positively skewed and 
was slightly leptokurtlc, although In neither case was It significantly 
different from a normal distribution. The skewness was comparable to that of 
most methods of scoring the pyramidal test, but the kurtosls Indicated that 
the two-stage score distribution was more peaked than those of the pyramidal 
test. 

Table 4 summarizes the performance of the total group of testees on the 
10-ltem routing test. 



Table 4 

Means and Standard Deviations of Scores on Subtests 
of the Two-Stage Test 



Subtest 



Composite 
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Subject Group 



Routing Test Measurement Test Two-stage Score 
(Number Correct) (Number Correct) ^Standard Score) 
N Mean S . D^^ Mean S.D. Mean S.D. 



All Subjects ill 5.58 2.61 
Assigned to 

Measurement test 1 21 9.33 0.48 
Assigned to 

Measurement Test 2 20 7.40 0.50 
Assigned to 

Measurement Test 3 27 5.63 0.49 
Assigned to 

Measurement Test 4 43 2.86 1.15 



18.56 5.04 

17.00 6.01 

17.20 4.43 

18.59 4.87 

19.93 4.66 



-0.33 1.30 

1.48 0.97 

0.22 0.65 

-0.54 0.70 

-1.33 0.81 
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Also shown are descriptive statistics of scores for the testees assigned to 
each measurement subtest. On the routing test, the mean number of items 
correct over all subjects was 5.58 out of 10 items, suggesting that the routing 
test was peaked at a difficulty appropriate for the group taking the test. 
That is, item difficulties for the group tested averaged about .60 which is 
the expected median difficulty after chance has been taken into account. The 
standard deviation of nuinber correct scores was relatively large (2.61), 
indicating that the routing test was effective in making an initial separation 
of testees according to ability. The mean number correct across all four 
measurement tests (18.56) showed that after testees had been routed into the 
measurement test, they answered slightly more than half the measurement test 
items correctly. For each 30-item measurement test considered separately, 
the mean nuaaber correct varied from 17.00 to 19.93 (or between 57 and 66 
percent correct). These findings imply that the measurement tests were also 
of appropriate difficulty for the groups of testees routed to them. These 
results, however, suggest that there were somewhat more successes due to 
guessing in the two-stage test than in the pyramidal test. 



The variability of scores for each of the four subject groups was rela- 
tively constant in three of the measurement tests; measurement test 1 had a 
slightly larger variability of scores than the other measurement tests. The 
variability in routing test scores for those subjects assigned to the least 
difficulty measurement test (1.15) was larger than th^t for the other groups 
due solely to the specifications of the routing procedure (i.e., a larger 
range of routing scores led to the assignment of testees to measurement test 
4, the least difficult measurement test). 



Relationship between Two-stage and Pyramidal Scores 

Eighty items were common to both the pyramidal and two-stage item pools. 
The ntimber of times a testee was administered the same item twice (once under 
each strategy) ranged from 0 to 13 with a mean of 6.02 and a standard deviation 
of 3.51. The correlations between the two tests are thus likely to be some- 
what inflated due to the tendency of subjects to make the same responses to 
an item in both the two-stage and pyramidal test, and should be Interpreted 
with caution. 

Table 5 shows the results of the regression analysis of the relationship 
between scores on the two-stage test and scores on the pyramidal test. 
Product-moment correlations ranged from .79 for the mean-difficulty-correct 
scoring method to .84 for the nuiTober correct scoring methods Correlation 
ratios ranged from .83 to .88. There was no general tendency toward 
curvilinear relationships. In only one of the regressions was curvilinearity 
significant to the .05 level. Thus, the relationship between scores on the 
two-stage and pyramidal tests is high and primarily linear. 



ERIC 



-18- 



Table 5 

Regression Analysis of Relationship between 
T\fo-8tage and Pyramidal Scores 
(N-111) 



Scoring Method 




Regression of 
Two-stage Score 
on Pyramid Score 


Regression of 
Pyramid Score on 
Two-stage Score 


r 


eta 


pa 


eta 




Number correct 


,84 


.85 


.71 


.88 


.25 


Mean difficulty- 
attempted 


.81 


.86 


.10 


.86 


.04* 


Mean difficulty- 
correct 


.79 


.84 


.23 


.84 


.15 


Difficulty of final 
Item 


.83 


.83 


.56 


.86 


.21 



^Significance of curvilinear Ity 
♦Significant at p<.05 
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Internal Consistency Reliability 

Table 6 shows the Internal consistency reliabilities for the two-stage 
subtests. The Internal consistency of the 10-*ltem routing test (.72) was the 
same as that of the least difficulty 30«->ltem measurement test. When number 
of Items was equated for the 10*ltem routing test and the 30-*ltem measurement 
tests 9 the routing test showed the bluest Internal consistency of the five 
subtests. This was likely due to the Intentional restriction in the range of 
abilities of subjects assigned to each measurement test by the routing process. 

Table 6 

Internal Consistency Reliabilities for Subtests of the Two-stage Test 

Number Hoyt Reliability 
Subtest N of Items Coefficient 

Routing m 10 .72 (.89^) 

Measurement 1 21 30 .84 

Measurement 2 20 30 .66 

Measurement 3 27 30 .75 

Measurement 4 43 30 272 

^Estimated reliability for a 30-ltem test. 
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Whlle this finding might have resulted from differences in item discriminations 
among the subtests, comparison of the data in Table 1 with those in Table 6 
show that the measurement tests with the highest average discriminations had 
the lowest reliabilities. Measurement test 1 (the most difficult measurement 
test) did, however, have a reliability which was almost as high as the 
corrected reliability of the routing test. 

Mis-routing 

In the two-stage test, only one testee in the sample of Til obtained a 
score of 6 or less on the measurement subtest and was thus considered 
mis-routed. A less difficult measurement test would have been more appropriate 
for him/her. No perfect scores were obtained on any measurement subtest. 
The misclassif ication rate was therefore 1/111*.009. 

Intercorrelations of Pyramidal Scores 

The intercorrelations of scores from the four methods of scoring the 
pyramidal test are shown in Table 7. Highest observed correlation (r«.99) 
was between the two mean difficulty scores. Number correct had the lowest 
correlations (r».93) with the two mean difficulty scores. There was no 
curvilinearity in these data since all the corresponding r*s and etas were 
virtually identical. 



Table 7 



Intercorrelations of Scores from 
Pyramidal Scoring Methods 
(N-111) 



Scoring 
Method 



Number 
Correct 



Mean difficulty- 
attempted 



Mean difficulty- 
correct 



Mean difficulty 
attempted 



r 
eta 



.93 
.93 



Mean difficulty* 
correct 
r 
eta 



.93 
.93 



.99 
.99 



Difficulty of 
final item 



r 
eta 



.98 
.98 



.95 
.95 



.95 
.96 
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DISCUSSION AND CONCLUSIONS 
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Score distributions for both the pyramidal and two-stage tests suggested 
that both were of appropriate difficulty for the general ability level of 
the testees. For the pyramidal tests » the mean score was slightly more than 
half of the possible range. Those pyramidal test scores which were expressed 
in standard units had means which were all about zero. The two-stage scores 
also had a near-zero mean. These results were similar to those obtained by 
Larkin and Weiss (1974) and Betz and Weiss (1973) . However, the latter study 
found mean scores for a similar two-stage test to be slightly closer to 
zero (-0.21 at time 1 and -0.02 at time 2) than in the present study (-0.33). 
In the previous investigation of two-stage tests, standard deviations were 
found to be 1.36 and 1.39. In the present study, the standard deviation was 
1.30. 

A "real data" simulation of the same two-stage test used in the present 
investigation (Betz i Weiss, 1974) resulted in a mean score of very near zero 
(-.004) and a standard deviation of 1.05. Thus, real testees obtained a 
lower average score, and were more variable on the two-stage test, than were 
simulated testees. Thess results suggest that there are very few chance 
successes due to guessing in actual administration of two-stage tests, since 
guessing would result in scores above zero, on the average. 

The two-stage test was found to utilize a smaller proportion OL its 
possible score range (.13) than the pyramidal test. This finding is consistent 
with the results of the two previous empirical studies in this series, in 
which two-stage tests and two methods of scoring pyramidal tests used a greater 
proportion of the score range than conventional tests. Betz and Weiss (1973) 
found that, for a similar two-stage test, the proportion of range utilized was 
.23. However, their index was computed by dividing the obtained standard 
deviation by fr (+3 s.d.) rather than the actual possible range of two-stage 
scores, thus inflating the index. The range of possible scores for the two- 
stage test in the present study was 9.96 rather than simply 6. Therefore, the 
proportion of range utilized is lower in the present study because of the 
change in the method of computation. 

Both adaptive tests provided score distributions which were slightly skewed 
in a positive direction, but, with the exception of one scoring method for 
the pyramidal test, the degree of skew was not statistically significant. 
Seeley, Morton and Anderson (1962) obtained a highly negatively skewed 
distribution cf scores on a pyramidal test. Their result » however, was possibly 
due to the easiness of their test and/or to the exclusion of some lower-ability 
examinees who did not carefully follow the instructions. Bayroff and Seeley 's 
(1967) results, however, were more similar to those found in the present 
study; they obtained a normal distribution of pyramidal scores when computer 
administration was employed. Larkin and Weiss (1974) found a tendency toward 
positive skew in two other pyramidal tests similar to the one used here. 

In their previous empirical study of two-stage testing, Betz and Weiss 
(1973) obtained score distributions which also tended toward positive skew but 
were not significantly different from a normal distribution. The two-stage 
simulation (Betz & Weiss, 1974) showed score distributions to have almost 
zero skew (-.04) when administered to a population distributed normally on 
ability. ^t25 
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There was a slight, but non-slgnlf leant , trend for most pyramidal score 
distributions to be platykurtlc. The tendency toward flatness In score 
distributions from pyramidal tests has been noted by Hansen (1969) who obtained 
a rectangular score distribution. Two similar pyramidal tests of Larkln and 
Weiss (1974) were significantly flat* The two-stage score distribution In 
the present study was slightly (and non-slgnlflcantly) leptokurtlc, Betz and 
Weiss (1973), however, found that a similar two-stage test produced a slightly 
flattened distribution of scores. With simulated data, Betz and Weiss (1974) 
found that score distributions on the same two-stage test used In the present 
study were significantly flat (p<.01)^ but less platykurtlc than distributions 
of scores for another two-stage test and a conventional test. Results of the 
present study, however, showed that the mean-dlfflculty-correct score derived 
from the pyramidal test gave results which were least deviant from a normal 
distribution In comparison to other pyramidal and two-stage test scores. 

The distributions of scores within the two-stage measurement subtests 
represented an Improvement over those obtained In the previous study of two- 
stage testing. First, the nuniber of testees assigned to each measurement test 
was more nearly equal. Betz and Weiss (1973) found that approximately half 
of the subjects completing their two-stage test were routed to the most diffi- 
cult measurement test. Further, the easier measurement tests In the previous 
study were found to be too easy for the testees routed to them. The more even 
distribution of testees routed to each measurement test in the present 
investigation can be attributed to the more appropriate difficulty of the 
routing test and to the revised procedure used to determine cutting scores for 
assignment to measurement tests. The improvement in the score distributions 
within the measurement tests is due to modifications making the more difficult 
measurement tests easier, and the less difficult measurement tests more 
difficult. 

The misclassificatioii rate for the two-stage test in this study was .009 
using the same criteria as those used by Betz and Weiss (1973), i.e., perfect 
scores (30) or chance scores (or less) on the measurement tests. This compared 
favorably with the 5% misclassif ication rate in Betz and Weiss (1973). 
The 20% rates obtained by Ai»goff and Huddleston (1958) and by Cleary, et al. 
(1969a,b; Linn, et al., 1969) were due primarily to the different misclassi- 
f ication criteria in their real-data simulation studies. The low rate of mis- 
classifications in the present study may be accounted for by (1) the more 
accurate assignment of subjects to measurement tests brought about by revisions 
in the routing tests, (2) the maximum likelihood procedure used for classifi- 
cation, (3) the increased cutting scores (no testee was routed to a measurement 
test in which he/she obtained a perfect score) and (4) the more appropriate 
difficulties of the items used in the measurement tests. 

The internal consistency reliabilities of the two-stage subtests also 
reflect the improvements in the difficulties of those subtests. For the 
routing test and three of the four measurement tests, measures of internal 
consistency were as much as .31 higher than the corresponding reliabilities 
found by Betz and Weiss (1973). This finding suggests that the difficulties 
of the measurement test items were more appropriate (i.e., approximating p».5) 
for the groups of subjects attempting them. The increased difficulty of the 
routing test items in the present study as compared to the previous empirical 
study resulted in routing test scores which had a standard deviation more 
than twice that found in the previous study. The changes made in the 
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measurenent tests » by decreasing the number of items which were much too 
easy or too difficult for the group routed to them> enabled the interitem 
correlations and thus the internal consistency reliability coefficient to 
increase. 

The correlations between scores on the pyramidal and two-stage tests 
obtained in this study ranged from r«.79 to .84 (eta** 83 to •88)* 7i*he two 
previous empirical studies In this series found correlations of r».32 to .89 
(eta* .84 to .92) between scores on the pyramidal and conventional testing 
strategies, and r».80 to .84 (eta».82 to .88) between scores on th^i two-stage 
and conventional tests. In the simulation study, Betz and Weiss (1974) found 
a correlation of r».82 (eta».82) between scores on the two-stage and conventional 
tests. Thus, it appears that scores on the two-stage test are aliriost as 
highly related to scores on a 15-stage pyramidal test as they are to scores 
on a 40-item conventional test. The relationship between scores an the two 
adaptive tests id almost as high as that between the pyramidal and conventional 
tests. In the two previous empirical studies, the items contained in the 
adaptive and conventional tests were non-overlnpplng. The present study, 
however, permitted some of the same items to be administered in both the 
pyramidal and two-stage tests, which may have somewhat inflated the correlation 
between them. An average of six items — or 40Z of the pyramidal test's items — 
were the same in both tests. 

The correlation between scores on the two adaptive strategies approached 
their empirical stabilities. The seven-week test-retest stability of the 
pyramidal test used In this study ranged from r».82 to .86 (eta". 85 to .90) 
depending on the scoring method used (Larkin & Weiss, 1974). The stability of 
scores of a two-stage test similar to the one used here was r».88 (Betz & 
Weiss, 1973). Using the Pearson coefficients, the correlation between the 
two-stage and pyramidal tests accounted for 62% to 71Z of the common variance. 
Stability of the adaptive tests showed that from 67% to 74% of the pyramidal 
test's variance was reliable while about 77% of the variance of the two-stage 
test was reliable. Thus, assuming that error variance is uncorrelated, 
from 42% to 53% of the reliable variance in the pyramidal test was common to 
the two-stage test, while from 48% to 55% of the reliable variance In the two- 
stage test was common to the pyramidal test. Further, the correlation between 
the two adaptive tests equalled or exceeded the internal consistency reliabilities 
of all the measurement tests and approached the internal consistency of the 
routing test when corrected for length. 

Several tenative conclusions can be drawn from these results. First, 
the results replicate previous findings which indicate that the order of 
administration of adaptive tests does not significantly affect scores on the 
tests. Consequently, research on different* adaptive strategies can proceed 
by administering two or more strategies successively to an individual without 
randomizing administration order. 

The results seem to support previous findings by Lord (1970) and Larkin 
and Weiss (1974) which indicate that the average difficulty scores are the 
most useful way of scoring pyramidal tests. Lord's results Indicate that hi? 
average difficulty score provides the most desirable information functions 
while Larkin and Weiss* results indicate that these scores are the most stable 
over short time intervals. And, in the present study, the mean-difficulty- 
correct score gave results which deviated least from a normal distribution. 

ERIC • 27 



-•23- 



Al though the distribution of ability In the subjects was unknown, this agree- 
ment of results across these studies Implies that It Is not unreasonable to 
assume that It was normal. Further research Is needed, however, with 
populations of known distribution of ability, to support this assumption. 

The data on score means for the two adaptive strategies suggest that few 
chance successes occurred, on the average, as the result .of guessing. These 
results support Hansen* s (1969) finding that decreases In guesrlng do occur 
when Item difficulties are adapted to each Individual's ability level. There 
was a suggestion In the data that the pyramidal strategy appeared to result in 
fewer chance successes due to guessing than did the two-stage t^trategy. This 
finding should also be further studied by research designed specifically to 
answer that question* 

Finally, the results suggest that the two adaptive strategies are not 
replacements for each other in terms of measuring the same variable in the 
same way. When the correlation between scores on the two adaptive strategies 
was considered with respect to available data on the rellabxlltles of the 
strategies, only about 50Z of the reliable variance of the two strategies 
was found to be common. Thus, each strategy orders individuals differently 
on estimated ability. Further research is needed to determine the reasons 
for these different ability estimates. 

Thus, a deficiency of the present study concerns the determination of the 
relative efficiency of the two testing strategies. The use of live subjects 
does not permit any estimation of the precision or accuracy of the scores 
obtained under either strategy, since the "true" ability of the testees was, 
of course, unknown. Thus, the degree to \Alch test scores accurately reflected 
underlying ability could not be determined. Live-testing empirical studies 
designed to answer this question will require very large samples of testees. 
Theoretical studies, as shown by Weiss and Betz (1974), appear to provide 
results which are not generalizeable beyond those conditions satisfying their 
restrictive assumptions. Thus, additional simulation studies (e.g., Betz & 
Weiss, 1974) seem to be necessary to determine which adaptive tests scored by 
which method provide most accurate measurement for testees of various ability 
levels. The simulation studies should then be followed by live-testing 
studies to validate the simulation findings. 
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